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Abstract 



1 Introduction 



The analysis of online least squares estimation is at the heart of many stochastic sequential 
decision-making problems. We employ tools from the self-normalized processes to provide 
j-^ • a simple and self-contained proof of a tail bound of a vector- valued martingale. We use the 

bound to construct new tighter confidence sets for the least squares estimate. 

We apply the confidence sets to several online decision problems, such as the multi-armed 
and the linearly parametrized bandit problems. The confidence sets are potentially ap- 
plicable to other problems such as sleeping bandits, generalized linear bandits, and other 
linear control problems. 

We improve the regret bound of the Upper Confidence Bound (UCB) algorithm of lAuer et al.l 
^ , ((2002) and show that its r egret is with high -probability a problem dependent constant. In 

the case of linear bandits ( Dani et al.l . |2008() , we improve the problem dependent bound in 
the dimension and number of time steps. Furthermore, as opposed to the previous result, 
we prove that our bound holds for small sample sizes, and at the same time the worst case 
' bound is improved by a logarithmic factor and the constant is improved. 

I The least squares method forms a cornerstone of statistics and machine learning. It is used as 

. the main component of many stochastic sequential decision problems, such as multi-armed ban- 

dit, linear bandits, and other linear control problems. However, the analysis of least squares in 
these online settings is non-trivial because of the correlations between data points. Fortunately, 
there is a connection between online least squares estimation and the area of self-normalized pro- 
cesses. Study of self-normalized p rocesses has a long his tory that goes back to Student and is 
^ I treated in detail in recent book by Ide la Pefia et al.l ()2009t ) . Using these tools we provide a proof 

?-H ■ of a bound on the deviat ion for vector-val ued martingales. A less general version of the bound 

can be found already in Ide la Pena et al.l ([2004, 2 003) . Additionally our proof, based on the 
m ethod of mixtures, is new, simpler and self-contained. The bound improves the previous bound 
of iRusmevichientong and TsitsiklisI (|2010f) and it is applicable to virtually any online least squares 
problem. 

The bound that we derive, gives immediately rise to tight confidence sets for the online least 
squares estimate that can replace the confidence sets in existing algorithms. In particular, the 
confidence sets can be use d in the UCB algo rithm for the multi-armed bandit problem, the Confi- 
d enceBall a lgorithm of iDani et al.l ()2008l ) for the linear bandit problem, and LinRel algorithm 
of lAueij (j2003f ) for the associative reinforcement learning problem. We show that this leads to im- 
proved performance of these algorithms. Our hope is that the new confidence sets can be used to 
improve the performance of other similar linear deci sion prob l ems. 

The multi-armed bandit problem, introduced by iRobbinsI (|1952f ) , is a game between the learner 
and the environment. At each time step, the learner chooses one of K actions and receives a reward 
which is generated independently at random from a fixed distribution associated with the chosen 
arm. The objective of the learner is to maximize his total reward. The performance of the learner 
is evaluated by the regret, which is defined as the difference between his total reward and the total 
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reward of the best action. iLai and Robbing (|1985[ ) prove a ^/^(Pj^Pi,) ~ o(l))logT lower 

bound on the expected regret of any algorithm, where T is the number of time steps, and pi are 
the reward dis t ributio ns of the optimal arm and arm i respectively, and D is the KL-divergence. 

lAuer et al] (|2002l) designed the UCB algorithm and proved a finite-time logarithmic bound 
on its regret. He used Hoeffding's inequality to construct confidence intervals and obtained a 
0{{K\ogT)/A) bound on the expected regret, where A is the difference between the expected 
rewards of the best and the second best action. We modify UCB so that it uses our new confidence 
sets and we show a stronger result. Namely, we show that with probability 1 — S, the regret of 
the modified algorithm is 0{Klog{l/6)/ A). Seemingly, this result contradicts the lower bound of 
iLai and Robbinj ()1985[ ). however our algorithm depends on 6 which it receives as an input. The 
expected regret of the modified algorithm with S = 1/T matches the regret of the original algorithm. 

In the linear bandit problem, the learner chooses repeatedly actions from a fixe d subset of R*^ and 
receives a random reward, expectation of which is a linear function of the action. iDani et all ()2008[ ) 



proposed the ConfidenceBall algorithm and showed that its regret is at most 0((ilog(T)-\/Tlog(T/i5)) 
with probability at most 1 — S. We modify their algorithm so that it uses our new confidence 
sets and we show that its regret is at most 0{d\og{T)^/T + dT \og{T / S)) . Additionally, con- 
stants in our bound are smaller, and our bound holds for all T > 1, as opposed the previous one 
which holds only for sufficiently large T. iDani et al] (|2008[ ) prove also a problem dependent regret 

bound. Namely, they sho w that the r egret of their algorithm is log^ Tlog(r/(5)) where A 

is the "gap" as defined in (jPani et al.l . I20Q8D . For our modified algorithm we prove an improved 
C>( V^) (log T + d\og log T)^) bound. 

1.1 Notation 

We use II • II to denote the 2-norm. For a positive definite matrix A G R'^^'^, the weighted 2-norm 
is defined by ||a:||^ — Ax, where x e M''. The inner product is denoted by (•, •) and the weighted 
inner-product Ay — {x,y)A- We use Amin(^) to denote the minimum eigenvalue of the positive 
definite matrix A. We use A)^ to denote that A is positive definite, while we use A ^ to denote 
that it is positive semidefinite. The same notation is used to denote the Loewner partial order of 
matrices. We shall use to denote the i^^ unit vector, i.e., for all j ^ i, Gij — and en = 1. 

2 Vector- Valued Martingale Tail Inequalities 

Let {J-k]k > 0) be a filtration, {mk]k > 0) be an R'^-valued stochastic process adapted to (J-fc), 
iilk]k > 1) be a real- valued martingale difference process adapted to {J'k)- Assume that rjk is 
conditionally sub-Gaussian in the sense that there exists some R > such that for any 7 € K, fc > 1, 

E[exp(777fc) I -Ffc-i] < exp ( j a.s. (1) 

Consider the martingale 

t 

St = '^Vkmk-i (2) 
fc=i 

and the matrix-valued processes 

t 

=^™fe_imj_i, Vt^V + Vt, t>0, (3) 
fc=i 

where V is an J^j-nieasurable, positive definite matrix. In particular, assume that with probability 
one, the eigenvalues of V are larger than Aq > and that ||mfc|| < L holds a.s. for any A: > 0. 
The following standard inequality plays a crucial role in the following developments: 

Lemma 1. Consider {rjt), (rrit) as defined above and let t be a stopping time with respect to the 
filtration (J^t)- Let A G M'^ be arbitrary and consider 



= exp ^ 

\k=l 

Then Pt is almost surely well-defined and 



r]k{X,mk- 



o (^'™fe-i/ 



R 2 



E [P^] < 1. 
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Proof. The proof is standard (and is given only for the sake of completeness). We claim that Pt — P/ 
is a supermartingale. Let 

/?7fc(A,mfc_i) 1 2 
Dk = exp ( (A,TOfe_i) 

Observe that by H]), we have K[Dk \J-k-i] < 1- Clearly, Dk is J^fc-adapted, as is Pk- Further, 

E[Pt\Tt-i]=E[Di---Dt-iDt \Tt-i]^ Di---Dt-iE[Dt \Tt-i] < Pt-i, 

showing that (Pt) is indeed a supermartingale. 

Now, this immediately leads to the desired result when t = t for some deterministic time t. 
This is based on the fact that the mean of any supermartingale can be bounded by the mean of its 
first element. In the case of (P*), for example, we have E [Pt] = E [E [Pt| J"t_i]] < E [Pt-i] < ■■■ < 
E[Po]^E[Do] = 1. 

Now, in order to consider the general case, let St — Pr a* El It is well known that (St) is still 
a supermartingale with E[St] < E[So] — E [Pq] = 1. Further, since Pt was nonnegative, so is 
St- Hence, by the convergence theorem for nonnegative supermartingales, almost surely, limj^oo St 
exists, i.e., Pr is almost surely well-defined. Further, E [Pr] = E [liminft^oo St] < hminft_j.oo E [St] < 
1 by Fatou's Lemma. □ 

Before stating our main results, we give some recent r esults, which can essentially be extracted 
from the paper bv iRusmevichientong and Tsitsikhi (|2010[ ). 

Theorem 2. Consider the processes {St), (Vt) as defined above and let 



K = v/3 + 21og((L2 + trace(F))/Ao). (4) 
Then, for any < 6 < 1, t > 2, with probability at least 1 — 6, 



\\St]\y-i < 2K^Ry/]^t y/d log(i) + l0g(l/5) . (5) 

We now show how to strength en the previous result using the method of mixtures, originally used 
by iRobbins and Siegmundl (|l970l ) to evaluate boundary crossing probabilities for Brownian motion. 

Theorem 3 (Self- normalized bound for vector- valued martingales). Let {rjt), [mt), {St), {Vt), and 
{J-t) be as before and let t be a stopping time with respect to the filtration {J-t)- Assume that V is 
deterministic. Then, for any < 6 < I, with probability 1 — 6, 



[[Sr]fy-. < 2P^l0g (^ "^"' f^^'- ' j . (6) 

Proof. Without loss of generality, assume that P = 1 (by appropriately scaling St , this can always 
be achieved). Let 



, /det(V^r)'/' det{Vy'^^ 



Mt(A)=exp((A,5t)-i IIAII^^J 



Notice that by Lemma [1] the mean of Mr (A) is not larger than one. 

Let A be a Gaussian random variable which is independent of all the other random variables and 
whose covariance is V~^. Define 

Mt=E[Mt{A)[Toc,]. 

Clearly, we still have E [Mr] = E [E [ Afr(A) | A] ] < 1. 

Let us calculate Mt: Let / denote the density of A and for a positive definite matrix P let 
c(P) = V(27r)''/det(P) = / exp(-ia;^Pa;)dx. Then, 



f{X)dX 



Mt= / exp((A,5t)-i ||A||'^J f{X)dX 
= I exp(-i \\\-Vt-'St\\^y^ + \]\St]\l- 



c{V) 



exp(i ||5t||'^-.) ^/xp(-i {\\X~Vt-'St\\l^ + ]\X]\l}) 



dX. 



At is a shorthand notation for min(r, t). 
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Elementary calculation shows that if P )^ 0, Q ;^ 0, 

\\x _ a\\l + \\x\\l = \\x - (P + Q)-'Pa\\l^^ + Ml - ||Pa||(p+Q)-i • 

Therefore, 

\\X-V,-'St\\l^ + \\X\\l = \\X-{V + Vt)-'St\\l^^^ + \\Vf'St\\l^ - WStf^y^y^^^, 

= \\X-iV + Vt)-'St\\l^y^ + \\St\\l-^ - , 



i-exp(i //xp(-i \\X-{V + V,)-'S,\\l^y^) 



which gives 

Aft = -^exp(i / exp{-h \\X~(V + V,)-'St\\l.,.] dX 

Now, from E [Mr] < 1, we obtain 



fur, u2 „, /det(F + K)'/' 1 



> 1 



< E 



1^,-1 (dct(y + y.)/det(T.))^ ; 

(^ll^rl 



exp ( f ||'^T||(y+v^)-l 



= E [A/^] S <6, 

thus finishing the proof. □ 

Corollary 1 (Uniform Bound). Under the same assumptions as in the previous theorem, for any 
< 6 < 1, with probability 1 — 6, 

«>0. ||S.||^-.<2«'log( *"^''7"'^'-" ), (7, 

Proof. We will use a stopping time construction, which goes back at least to iFreedmginl (|l975f ). 
Define the bad event 



Bti6) = ^Lv e n : \\St\\l-i >2RHogi^^^^^^ j| (8) 

We are interested in bounding the probability that Ut>o^t('^) happens. Define t(w) = min{< > 
: a; G Bt{S)}, with the convention that min0 = oo. Then, r is a stopping time. Further, 

U BtiS) = {w : t(w) < oo}. 
t>o 

Thus, by Theorem [3] 

=P(r <(X3) 



fdet{Vt)^/^ det(F)-i/2 



if>0 



o ,,2 .„2, /det(V;)i/2det(F)-i/2 
S'^ll^-i > 2R^ log f ^—^ ^ ^ ' ) , r < oo 



j^ det(K)i/" det(F)-i/^ ^ j 



< S . 

□ 
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Let us now turn our attention to understanding the determinant term on the right-hand side 
of ©. 

Lemma 4. We have that 

det{Vt) ^ II ||2 

k=l 



Further, we have that 
t 



(^\\nik^i\\y-i^ A < 2(logdet(yt) -logdety) < 2(dlog((trace(F) + ti^)/^) _ logdetF). 

k=l 

Finally, if Xq > max(l,L^) then 

Il2 <-oi det(Ft) 

2^ ll"lA;-l|ly-i <2l0g 

' * k — 1 

Proof. Elementary algebra gives 



det(F) 



det{Vt) = det(yt_i + mt-imj_^) = dct(yt_i) det(/ + Vtl{'mt^i{Vtl{'mt-i)'^) 

t 

= det(Ft_i) (1 + ||mt_i||^-_i^) = det(V^) H [l + \\mk-if^-i^) , (9) 

fe=i 

where we used that all the eigenvalues of a matrix of the form I + xx'^ are one except one eigenvalue, 
which is 1 + \\x\\'^ and which corresponds to the eigenvector x. Using log(l + t) < t, we can bound 
logdet(Ft) by 

t 

logdet(Ft) < logdetF + ^ ||mfe_i|||-i^ . 

A,-l 

Combining x < 2 log(l + x), which holds when x E [0, 1], and ([9]), we get 

t t 

^ l) 2^1og(l + ||77ife_i|||,-iJ = 2(logdct(Ft) -logdetl/). 

k=l k=l 

The trace of Vt is bounded by trace(V^) + iL^, assuming ||TOfc|| < L. Hence, det{Vt) — IliLi — 
^ tr^ceiv)+tL- y theieiove, 

logdet(Ft) < dlog{{tTace{V) +tL^)/d), 

finishing the proof of the second inequality. The sum X]l=i II"**:-! can itself be upper bounded 

as a function of log det(yt) provided that Ao is large enough. Notice ||mfe_i||2^-i < X^^l^{Vk-i) ||TOfe-i||^ < 

L^/Xq. Hence, we get that if Ao > max(l,L^), 

l°g^irr ^ g ^ 21og^^. 

□ 

Most of this argument can be extracte d from the paper o f iDani et al.l (|2008l ). However, the 
idea goes back at least to|Lai_et al.' (TOTO*), Lai and Wei (1982*) (a similar argument is used around 
Theorem 11.7 in the book bv iCesa- Bianchi and Lugosi (2006)). Note that Lemmas B.9-B.11 of 
iRusmevichientong and TsitsiklisI ()2010f ) also give a bound on J2k=i ll"*fc-i||,7-i , with an essentially 

' k — 1 

identical argument. Alternatively, one can use the bounding technique of lAueil ([2003) (see the proof 



of Lemma 13 there on pages 412-413) to derive a bound like X]fc=i ll'^fe-iIlT^-'^ — Cdlogt for a 



suitable chosen constant C > 0. 
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Remark 5. By combining Corollary [T] and Lemma HI we get a simple worst case bound that holds 
with probability 1 ~ S: 

Still, the new bound is considerably better than the previous one given by Theorem JH Note that the 
log(t) lactor cannot be removed, as shown by Problem 3, page 203 in the book bv Ide la Pefia et "all 
p009 l. 

3 Optional Skipping 

Consider the case when d — 1, nik = Sk G {0, 1}, i.e., the case of an optional skipping process. Then, 
using again V — I — l,Vt — + Y^\=i ^k-i 1 + TV* and thus the expression studied becomes 

II . _ lELi^fe-i^fel 



We also have 

log det(y,) = ^ log 1 + ^ < Y: = E ^ ^ 1 + / ^''dx^l + log(l + N,). 

k=l ^ T fe/ -r k J I 

Thus, we get, with probability 1 — 5 



Vs > 0, 



^ek-itlk 



fe=i 



<W(1 + 7V,) (l + 21og(^i±^n- (11) 



If we apply Doob 's optional skipping and Hoeffding-Azuma, with a union bound (see, e.g., the paper 
of iBubeck etHI (godSl)), we would get, for any 0<5<l,t>2, with probability 1 — 5, 



VsG {o,...,o, 



^Sk-lllk 

fc=i 



< ^/27V, log( |). (12) 



The major difference between these bounds is that ([T2|) depends explicitly on t, while (fTTj) does 
not. This has the positiv e effect that one nee d not recompute the bound if Nt does not grow, which 
helps e.g. in the paper of^B ubeck et al.l ()2008l ) to improve the computational complexity of the HOO 
algorithm. Also, the coefficient of the leading term in (|lip under the square root is 1, whereas in ([T^ 
it is 2. 

Instead of a union bound, it is possible to use a "peeling device" to r eplace the conservative 
logt f actor in the above bound by essentially log log This is done e.g. in iGarivier and MoulinesI 
(|2008|) in their Theorem 220 From their derivations, the following one sided, uniform bound can be 
extracted (see Remark 24, page 19): For any < 5 < 1, t > 2, with probability 1 — 5, 



yse{0,...,t}, Y:^e,^.Vk<^'^^lo,[^). (13) 

As noted bv IGarivier and MoulinesI (|2008f) . due to the law of iterated logarithm, the scaling of the 
right-hand side as a function of t cannot be improved in the worst-case. However, this leaves open 
the possibility of deriving a maximal inequality which depends on t only through Nt. 

4 The Multi- Armed Bandit Problem 

Now we turn our attention to the multi-armed bandit problem. Let fii denote the expected reward 
of action i and = fi^ — fii, where /i* is the expected reward of the optimal action. We assume 
that if we choose action It in round t, we obtain reward /i/^ + rjt. Let Ni^t denote the number of 
times that we have played action i up to time t, and Xi^t denote the average of the rewards received 
by action i up to time t. From (1111) with 5/K instead of 6 and a union bound over the actions, we 
have the following confidence intervals that hold with probability at least 1 — S: 

Vie{i,...,if}, Vse{i,2,...}, \x,^s - ^^^\ < c,,s , (14) 



^They give their theorem as ratios, which they should not, since their inequality then fails to hold for 
Nt = 0. However, this is easy to remedy by reformulating their result as we do it here. 
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where 



' ' l + 21og ' ^ ' ' 



Nf 



Modify the UCB Algorithm of|A uer et alj (|2002l) to use the confidence intervals and change the 
action selection rule accordingly. Hence, at time t, we choose the action 

It = argmax Xi^t + Ci^t- (15) 

i 

We call this algorithm UCB ((5). 

Theorem 6. With probability at least 1 — (5, the total regret of the UCB(d) algorithm with the action 
selection rule (|15p is constant and is bounded by 



rAi>0 

where is the index of the optimal action. 

Proof. Suppose the confidence intervals do not fail. If we play action i, the upper estimate of the 
action is above /i*. Hence, 

Substituting Ci^s and squaring gives 



By using Lemma 8 of lAntos et al.l (|201Cll ). we get that 

o 16 , 2K 



Thus, using R{T) — ^i^i^T, we get that with probability at least 1 — 5, the total regret is 

bounded by 



i:Ai>0 



□ 



Remark 7. iLai and RobbinsI (|l985f ) prove that for any suboptimal arm j, 

\ogt 



where, and pj are the reward density of the optimal arm and arm j respectively, and D is the 
KD-divergence. This lower bound does not contradict Theorem |6l as Theorem [6] only states a high 
probability upper bound for the regret. Note that UCB((5) takes delta as its input. Because with 
probability (5, the regret in time t can be i, on expectation, the algorithm might have a regret of t5. 
Now if we select 5 = 1/t, then we get 0{\ogt) upper bound on the expected regret. 

5 Application to Least Squares Estimation and Linear Bandit Problem 

In this section we first apply Theorem [3] to derive confidence intervals for least-squares estimation, 
where the covariate process is an arbit rary process and then use these confidence intervals to improve 
the regret bound of iDani et al.l ((lOOj) for the linear bandit problem. In particular, our assumption 
on the data is as follows: 

Assumption Al Let (J-i) be a filtration, (xi, yi), . . ., {xt,yt) he a. sequence of random variables 
over R'' X R such that Xi is J^i-mcasurable, and yi is J^i+i-measurable {i = 1,2,.. .). Assume that 
there exists 0, € M'* such that E[yi|J^i] — xJO^,, i.e., et = yi — xj 9^, is a martingale difference 
sequence (E [£i|J-^i] = 0, i = 1, 2, . . .) and that Si is sub-Gaussian: There exists i? > such that for 
any 7 e R, 

E[exp(7e,)|^z-i] < 
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We shall call the random variables Xi covariates and the random variables yi the responses. Note 
that the assumption allows any sequential generation of the covariates. 

Let 9t be the £^ -regularized least-squares estimate of 0^ with regularization parameter A > 0: 

Ot^iX^X + XI)-'^X^Y, ^0 = 0, (16) 

where X is the matrix whose rows are xj , . . . , xj_i and Y = {yi, . . . , yt_i)^. We further let e = 
(ei, . . . , Et-i)^ ■ 

We are interested in deriving a confidence bound on the error of predicting the mean response 
x^Ot. at an arbitrarily chosen random covariate x using the least-squares predictor x^dt- Using 

Bt = {X^X + XI)-^X^{X9^+e) 

= {x^x + xiy^x^e + {x^x + xiy'^ix^x + xi)e^ - x{x^x + xi)-^e, 

= [X'^X + Xiy^X^e + 9^- X{X^X + XI)-^9, , 

x'^9t - x'^9^ = x'^iX'^X + Xiy^X'^e - Xx^ [X'^ X + Xiy^9^ 
= {x,X'^e)y-i -X{x,9^)y-i, 

where Vt — X^ X + XI. Note that Vt is positive definite (thanks to A > 0) and hence so is V^"^, so 
the above inner product is well-defined. Using the Cauchy-Schwartz inequality, we get 



we get 



\x'^9t-x'^9J < 



< 



\\x\\y-^ {\\X^s\\y-,+X\\94y-^) 

\\x\\y-. (||xTe||^_.+AV2 we^y 



where we used that ||6l*||^-i < l/A,ni„(Vt) ||6'*|| < 1/A H^^lf . Fix any < ,5 < 1. By CoroUarylH 
with probability at least 1^6, 



W>1, ||X-dL_.<Ej21og^^^^(^*)^^^^^*(^^)"^^^ 



Therefore, on the event where this inequality holds, one also has 



\x-^-x-9.\ < Myy [i^^21og( -^--^(^-)^";-^(^^)'^^^ )+AV^ . 

Similarly, we can derive a worst-case bound. The result is summarized in the following statement: 

Theorem 8. Let {xi, yi), . . . , {xt-i, yt-i), Xi e R'*, yi e M satisfy the linear model Assuinvtion \Al\ 
with some R > 0, 9^, E M.^ and let (J-'t) be the associated filtration. Assume that w.p.l the covariates 
satisfy \\xi\\ < L, i = 1, . . . ,n and \\9.^,\\ < S. Consider the -regularized least-squares parameter 
estimate 9n with regularization coefficient A > (cf. (|16p ). Let x be an arbitary, -valued random 

variable. Let Vt = A/ + X]*=i XixJ be the regularized design matrix underlying the covariates. Then, 
for any < i5 < 1, with probability at least 1 — 8, 



W > 1, \x-'9t - x-'9A < My-. IrJ21os ( '^^^^^-)^'^;^^^('')'^'^ ) + aV^ ^ j . (17) 



Similarly, with probability 1 — 6, 



yt > 1, \x'9t-x'9,\ < \\x\\y-i I R 



d\og ^\+X'^'s\. (18) 



Remark 9. We see that A -> oo increases the second term (the "bias term") in the parenthesis of 
the estimate. In fact, A — J' oo for n fixed gives A"^/^ ll^^lly-i — ^ const (as it should be). Decreasing A, 

on the other hand increases ||a;||y-i and the log term, while it decreases the bias term A^^^S'. 
From the above result, we immediately obtain confidence bounds for 9^,■. 
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Corollary 10. Under the condition of Theorem\^ with probability at least 1 — S, 

<i?W21og( r^''"' ) +X^^^S. 



Vt > 1, 

Also, with probability at least 1 — 6 

yt > 1 



Vt 



Vt 



< R 



\ 



Proof. Plugging in x = Vt{9t — 6**) into pT| . we get 

2 ' 



< 



Vt 



_ li?W21og 



/det(Ft)V2 det(A/)-V2 



(19) 



Now, 



Vt 



and therefore either 



= 0, in which case the con- 



-9, 



Vt 



to obtain the desired result. 



elusion holds, or we can divide both sides of (|T9|) by 

Vt 

Remark 11. In fact, the theorem and the corollary are equivalent. To see this note that a;^(0t — (?*) 

\x'^{0t-0*)\ 



□ 



9t - e.yvl^Wt'^^^x, thus 



sup ■ 



\m\vt-' 



Vt 



Remark 12. The above bound could be compared with a similar bound of iDani et al.l ()2008[ ) whose 
bound, under identical conditions, states that (with appropriate initialization) with probability 1 — ^, 



for all t large enough. 



Vt 



< i?max <^ J 128 dlog(t) log - , - log - 



t^\ 8 



(20) 



where large enough means that t satisfies < S < t^e^^^^^. Denote by /3t{S) the right-hand side 
in the above bound. The restriction on t comes from the fact that Pt{S) > 2d{l + 2 log(t)) is needed 
in the proof of the last inequality of their Theorem 5. 

On the other hand. Theorem [2] gives rise to the following result: For any fixed t > 2, for any 
< (5 < 1, with probability at least 1 — 5, 



Vt 



< 2 K^R,/k^ ^d log(t) + log(l/(5) + X^^^S , 



where k is as in Theorem [5J To get a uniform bound one can use a union bound with St — S/t'^. 

Then = ^(t- - 1) < This thus gives that for any < S < 1, with probability at least 

1-6, 

for aU t > 2, 



Vt 



< 2 K^Ry/hit ^d log(i) + log(t2/(5) + X^I'^S , 



This looks tighter than (f20| . but is still lagging beyond the result of Corollarv ITOl 
5.1 The Linear Bandit Problem 

We now turn our attention to the linear bandit problem. Assume the actions lie in I? C M'' and for 
any x £ D, < L. Assume the reward of taking action x £ T) has the form of 

htix) = 9jx + rjt 
and assume Vx G T), 9jx e [—1, 1]. Define the regret by 

T 

R{T) ^ J2^9jx, - 9jxt), 



where a;* is the optimal action (x, = argmax^gj, 6'Ja;). Define the confidence set 

Ct{6) ^{9 -.{9- 9t)^Vt{9 - 9t) < I3t{6)} , 



(21) 
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Input: Confidence < S < 1. 

for t := 1,2, . . . do 

{et,xt) = argmax(e^.)gct(5)xc^^a:^- 
Play Xt and observe reward ht(xt). 
Update Vf and Cj. 

end for 



Table 1: The Linear Bandit Algorithm 



where 

Consider the ConfidenceBall algorithm of iDani et alj ()2008[ ) . We use the confidence intervals (|2T|) 
and change the action selection rule accordingly. Hence, at time t, we define 9t and Xt by the following 
equation: 

{9t,Xt) — argmax 9^x. (22) 
{e.x)eCt(S)xv 

The algorithm is shown in Table [1] 

Theorem 13. With probability at least 1 — 6, the regret of the Linear Bandit Algorithm shown in 
Table Q] satisfies 

Vr>l, R{T) < 4y/Td\og{X + TL/d) (^X^^^S + Ry/2\ogl/6 + d\og{l + TL/{Xd))^ . 

Proof. Lets decompose the instantaneous regret as follows: 

n = - 9jxt 
< 9jxt - 9jxt 
= {9t - 9,)'^xt 

- {9t - 9,)'^xt + {9t - 9tVxt 

<^/W)Mv-^. (23) 
where the last step holds by Cauchy-Schwarz. Using and the fact that rt < 2, we get that 

rt < 2min(y^||a;t||^-i ,1) < 2\/^min(||a;t|l^-i , 1). 
Thus, with probability at least 1 — S, VT > 1 

T 

8/3TT^minK2,l) < 4v//3TTlog(det(yT)) 

t=i 

< A^/Td\og{X + tL/d) (a^/^S* + i?v/21ogl/(5 + dlog(l +tL/(Ad))) . 
where the last two steps follow from Lemma 21 □ 
5.2 Saving Computation 

The action selection rule ((22)) is NP-hard in general ()Dani et alJ . |2008[ ). In this section, we show 
that we essentially need to solve this problem only 0{logt) times up to time t and hence saving 
computations. Algorithm [5] achieves this objective by changing its policy only when the volume of 
the confidence set is halved and still enjoyes almost the same regret bound as for Algorithm [T] 

Theorem 14. With probability at least 1 — S, VT > 1, the regret of the Linear Bandit Algorithm 
shown in Table [H satisfies 

R{T) < A^/2Tdlog{X + TL/d) (^X^^'^S + R^/2 log 1/5 + dlog(l + TL/{Xd))'^ + A^d\og{T/d). 



R{T) < 



\ 



< 
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Input: Confidence < S < 1. 

T = 1 {This is the last timestep that we changed the action} 
for < := 1,2, . . . do 

if det(Vt) > 2det(K) then 

{9t,xt) = argmax(g ,,)gct(<5)xx)6'^2;. 

T = t. 

end if 

Xt — Xf . 

Play xt and observe reward ht(xt). 
end for 



Table 2: The Linear Bandit Algorithm 



First, we prove the following lemma: 



Lemma 15. Let A, B and C he positive semi-definite matrices such that A = B + C . Then, we 
have that 

x^Ax det(A) 
sup^P < — 7 — r- 

x-io X ' Bx det(_B) 

Proof. We consider first a simple case. Let A = B + mni^ , B positive definite. Let x 7^ be an 
arbitrary vector. Using the Cauchy-Schwartz inequality, we get 



(x'mf = {x'^B^/'^B-'/'mf < 



1/2, 



B'/\ 



B 



-1/2, 



Thus, 
and so 



x^{B + mvi^)x < x^Bx + \\x\\% ||m||^_i = (1 
x'^Ax 



|a;|| n llm-lli 



l^ills-i) \\x\\b 



X ' Bx 



< 1 



IB- 



Lb-Mi 



det(B)(l + 
- • • • + TO5_imJ_j^ and use 



We also have that 

det(A) = det(B + mmJ) = det(B) det(/ + ^"^/^/^(B^i/^m)^) = 

thus finishing the proof of this case. 

If A — B + mimj -I- • • • + mt-.\mj_i, then define Vs = B -\- mimj 

x^ Ax x^VtX x^Vt-ix x^V2X 
x^ Bx x^ Vt-ix x^ Vt-2X x^ Bx 

By the above argument, since all the terms are positive, we get 

x'^Ax ^ det(Vt) det(Vt-i) dei{V2) _ det(T4) _ det(A) 
x'^Bx - det(Ft_i) det(Vt_2) ' ' ' det(B) ~ det(B) ~ det(B)' 

This finishes the proof of this case. 

Now, if C is a positive definite matrix, then the eigendecomposition of C gives C = KU , 
where U is orthonormal and A is positive diagonal matrix. This, in fact gives that C can be written 
as the sum of at most d rank-one matrices, finishing the proof for the general case. 

□ 



Proof of Theorem Let Tt be the smallest timestep < t such that xt = Xr^ ■ By an argument 
similar to the one used in Theorem 1131 we have 



n < {9r, ~ e,)'^ Xt + {Or, - 9r,V Xt. 
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We also have that for all 6 G Cr^ and x, 



{0 




^1/2 



Or) \[x^V^ 



where the second step follows from Lemma [151 a-nd the third step follows from the fact that at time 
t we have det(Vi) < 2det(l/rf)- The rest of the argument is identical to that of Theorem [T51 We 
conclude that with probability at least 1 — 5, VT > 1, 



RiT) < 4^/2Td\og{X + tL/d) (^X^/'^S + R^2 log l/<5 + dlog(l + tL/{\d)) 



□ 



5.3 Problem Dependent Bound (A > 0) 

Let A be as defined in ()Dani et all [200l). In this section we assume that A > 0. This in- 
cludes the cas e when the action set is a polytope. First we state a matrix perturbation theorem 
from [Stewart and SunI y_990) that will be used later. 

Theorem 16 ([Stewart and SunI ([l990l) . Corollary 4.9). Let A be a symmetric matrix with eigenval- 
ues vi > V2 ^ ■ ■ ■ ^ Vd, E he a symmetric matrix with eigenvalues ei > e2 > . ■ ■ > Cd, and V = A+E 
denote a symmetric perturbation of A such that the eigenvalues of V are i'i> V2> ■ ■ ■> Vd- Then, 
fori^l,...,d, 

i>t e [vi + Grf, Vi + ei]. 

Theorem 17. Assume that A > for the gap A defined in llDani et all \200A ). Further assume 
that A > 1 and S > 1. With probability at least 1 — S, VT > 1, the regret of the algorithm shown in 
Table [I] satisfies 



R{T) 



WR'^XS^ 



iog(Lr) + (d-i)iog 



64i?2AS'2L 
A2 



2(d- l)log dlog 



dX + TL^ 



21og(l/5) +21og(l/,5) 



(24) 



Proof. First we bound the regret in terms of logdet(VT)- We have that 

RiT) = E < E I < ^ log(det(FT)), 
t=i t=i 

where the first inequality follows from the fact that either = or A < , and the second inequality 
can be extracted from the proof of Theorcm ll3l Let bt be the number of times we have played a sub- 
optimal action (an action Xg for which Ojx^, — Ojxg > A) up to time t. Next we bound logdet(V() 
in terms of bt . We bound the eigenvalues of Vt by using Theorem [THl 

Let Et = ^sxj and At = Vt — Et = (t ~ bt)xtxj . The only non-zero eigenvalue of 

(t — bt)x^xj is [t — bt)L* , where L* = xjx^ < L. Let the eigenvalues of Vt and Et he Xi > ■ ■ ■ > Xd 
and ei > ■ ■ ■ > Cd respectively. By Theorem 1161 '^6 have that 

Xi e[{t-bt)L* +ed,{t~bt)L* +ei] and Vi € {2, . . . , d}, A, G [e<j, ei]. 

d 

det{Vt) ^Y[X^<{{t- bt)L* + ei)e^-i < ((t - bt)L + ei)e^^ 

i 

logdet(yt) < log((< - bt)L + ei) + (d - 1) logei. 
Because trace(i?) = J2s-x^^x, trace(a::sa;J) < Lbt, we conclude that ei < Lbt- Thus, 
logdet(yt) < log((t - bt)L + Lbt) + (d - 1) log(L5t) 
= log(Li) + (d-l)log(L6t). 



Thus, 



Therefore, 



(25) 
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With some calculations, we can show that 

;3tlogdetyt < AR^XS^{2\og{l/6)+\ogdetVt)^ < AR^XS^ (dlog ""' '^"^ +21og^j , (26) 

where the second inequahty follows from Lemma 21 Hence, 

16/3* , , 64i?2AS'2 / dX + tL^ 1\^ 
bt<^\ogidet{Vt))< (^dlog + 21og-j , (27) 

where the first inequahty follows from R{t) > btA. Thus, with probability 1 — S, VT > 1, 

i?(r)<^log(det(FT)) 

< ^^^2^\ log{detiVT)) + 21og(l/<5))2 

< ^ (log(Lr) + {d-l) logiLbr) + 2 \og{l/6)r 

< ( log(Lr) + id-l) log ^^^^ + 2(rf - 1) log f dlog ^^±1^ + 2 log(l/J)') + 2 log(l/5) 



2 



where the first step follows from ([241) . the second step follows from the first inequality in ([261) . the 
third step follows from ([25)) . and the last step follows from the second inequality in (l27l) . □ 



Remark 18. The problem dependent regret of (|Dani et all |2008[ ) scales like 0(^ log"^ T), while our 
bound scales like 0(^(log^ T + dlogT + log log T)). 
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